Flood susceptibility mapping with ensemble machine learning

A case of Eastern Mediterranean basin, Turkey

Dusty Turner

Authors

  • Huseyin Ozdemir
    • Graduate School of Natural and Applied Sciences, Gazi University, Ankara, Turkey
  • Musteyde Baduna Koçyigit
    • Civil Engineering Department, Faculty of Engineering, Gazi University, Ankara, Turkey
  • Diyar Akay
    • Department of Industrial Engineering, Faculty of Engineering, Hacettepe University, Beytepe, Ankara, Turkey

Purpose of the Paper

What is the Study?

  • Flooding represents a major challenge due to its unpredictable nature, complex influencing factors, and severe impact on infrastructure and communities.

Why is it Important?

  • The study of floods is crucial for understanding their destructive impact on human life, economic systems, and the environment.

What is the Goal?

  • The research aims to create ensemble models that effectively map flood susceptibility in the Eastern Mediterranean Basin.

  • Planners can make decisions based on this knowledge.

Situational Awareness

Mediterranean Basin

Mediterranean Basin

Mediterranean Basin

  • Population and Area:
    • The basin houses 2.4% of Turkey’s population.
    • Constitutes approximately 3% of the country’s surface area.
  • River Characteristics:
    • Rivers in the basin are short with steep beds, except for the Göksu and Berdan Rivers.
    • No alluvial floor along the river beds; rivers pass through narrow valleys.
  • Geography and Altitude:
    • Average altitude varies between 0–2000 meters.
    • Peaks exceed 3000 meters.
  • Soil Structure:
    • 14 large soil groups determined by precipitation and climate.
    • Common soils include red and red-brown Mediterranean soil, non-calcareous brown and brown forest soil, alluvial, and colluvial soils.

Data Available

Flood Susceptibility Model Features
Feature Definition Units Resolution
Elevation Height above sea level m 10m
Slope Terrain incline m 10m
Aspect Slope direction m 10m
Profile Curvature Slope curvature m 10m
SPI Water erosive power m 10m
STI Sediment transport potential m 10m
TWI Soil moisture indicator m 10m
TRI Terrain heterogeneity m 10m
Distance from River Proximity to rivers m 10m
Drainage Density Stream length per area m 50m
CN Runoff potential - 25m
Rainfall Precipitation amount m 10m

“The digital maps and relevant data used in the findings of this study were obtained from the governmental bodies in Turkey that are not publicly available. So, the data used in this study cannot be made available.”

What is an Ensemble Model?

What is an Ensemble Model?

Two Types of Ensemble Models

  • Option 1: Between different Machine Learning algorithms
  • Option 2: Within Machine Learning algorithms

Selected Machine Learning algorithms

  • Artificial Neural Networks
  • Support Vector Machine
  • Decision Trees
  • Gradient Boosting Trees

Tuneable
Available
k - fold cross-validation

Artificial Neural Network

\[z = w_1x_1 + w_2x_2 + \ldots + w_nx_n + b\]

  • \(w_i\): Weight associated with input \(x_i\)
  • \(b\): Bias term

\[a = f(z)\]

  • \(f\): Activation function
  • \(a\): Output of the neuron

The network is trained using algorithms such as backpropagation, where the model’s predictions are compared to the actual target values, and the error is used to adjust the weights and biases iteratively.

Support Vector Machine

A Support Vector Machine (SVM) is a supervised machine learning model used for classification and regression tasks. It finds the best hyperplane that separates the classes while maximizing the margin.

Optimization Objective:

  • Maximize \(M\)

Subject to:

  • \(\sum_{j=1}^{p} \beta_j^2 = 1\)
  • \(y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip}) \geq M(1 - \epsilon_i)\)
  • \(\epsilon_i \geq 0\)
  • \(\sum_{i=1}^{n} \epsilon_i \leq C\)

Handling Non-linear Separability:

  • Linear Kernel: \(K(x, y) = x^T y\)
  • Polynomial Kernel: \(K(x, y) = (1 + x^T y)^d\)
  • Sigmoid Kernel: \(K(x, y) = \tanh(\kappa x^T y + c)\)

Decision Trees

  • Operation: Approximates training set with if-then-else decision rules.
  • Depth and Complexity: Deeper trees have more complex decision rules leading to a potentially more fitting model.
  • Partitioning: Iteratively divides the feature space, grouping samples with similar characteristics or target values.
  • Evaluation Metrics: Utilizes cross-entropy, Gini index, or misclassification error for classification.

Gradient Boosting Trees

  • Basic Idea: Combine multiple weak learners (typically decision trees) to create a stronger learner.
  • Sequential Learning: Each tree is built to correct the errors of the previous one.
  • Gradient Descent: Adjusting the model in the direction that decreases the error the most.
  • Shrinkage: The predictions of each tree are multiplied by a factor between 0 and 1. This slows down the learning process, leading to more robust models.

Ensemble Process

Using Auto-Sklearn in Python:

  1. Data Preprocessing:
    • Scaling inputs
    • Assigning missing values
    • Categorical factors
    • Coding and balancing target classes
  2. Feature Preprocessing:
    • Feature selection: select percentile, select rate
    • Kernel approximation: Nystroem sampler, random kitchen sinks
    • Matrix decomposition: Principal Component Analysis, kernel PCA, fast Independent Component Analysis
    • Embeddings: Random tree embedding
    • Feature clustering: Feature agglomeration
    • Polynomial feature expansion: Polynomial feature

Ensemble Process

  1. Tune each model
  2. Select the model with the best characteristics – Area Under the Receiver Operator Characteristic (AUC) – Ideally, model results are uncorrelated:

McNemar Test: Compares outcomes of two models using a \(\chi^2\) test

              Model B
              Yes   No
          -------------------
Model A Yes |  a  |  b  |
          -------------------
        No  |  c  |  d  |
          -------------------

The test statistic for the McNemar test is calculated as:

\(\chi^2 = \frac{(b-c)^2}{b+c}\)

Evaluation




Model Single (AUC) Ensemble (AUC)
SVM 0.854 0.877
ANN 0.834 0.936
GBT 0.889 0.922
DT 0.868 0.910

Evaluation

Model Type Correctly Predicted Flooding Points
Single ANN 36/47
Ensemble ANN 41/47
Single DT 38/47
Ensemble DT 42/47

Performance

Performance

Conclusion

  • 🌊 Predicting potential flood hazard areas is more than a technical challenge; it’s about safeguarding communities and economies.
  • 🎯 Their models, harnessing advanced ML algorithms, have shown significant accuracy in identifying these areas.
  • 🤝 By integrating these insights, decision-makers at all levels can devise more effective flood and basin management strategies.
  • 📚 My goal today was to distill and convey the significant findings of this research.